# reinstall packages if necessary

#install.packages("GGally")
# Load libraries
library(knitr)
library(class)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Load the built-in iris data set.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Analyze Relationships

To investigate relationships between variables within our dataset, we can generate a correlation matrix. This can be achieved using the pairs.panels() function from the PSYCH package. By creating this matrix, we can observe how different variables correlate with one another, providing valuable insights into potential associations and dependencies.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# Define custom colors for the Species levels.
custom_colors <- c("red", "black", "orange")


# Create the pairs.panels plot.
pairs.panels(
  iris[,1:4], 
  scale = TRUE, 
  hist.col = 'grey85', 
  bg = custom_colors[iris$Species],  # Use custom colors.
  pch = 21, 
  main = 'Correlation Matrix: Setosa(red), Versicolor(black), Virginica(orange)'
)

The top section of the correlation matrix provides valuable insights into the relationships between variables. It’s evident that there are notable and significant correlations among most variables, with the exception of the correlation between “sepal length” and “sepal width,” which appears to be weaker.

Moving to the lower half of the matrix offers a more comprehensive view. Here, not only do we observe scatter plots illustrating these correlations, but we also benefit from a color-coded distinction between the data points based on iris species. This color differentiation allows us to discern distinct clusters or groupings among the various species, shedding light on the underlying patterns within the data.

Interactive 3D Visualization

Removed sepal width, which was the least significant factor

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Define custom colors for the Species levels.
custom_colors <- c("red", "black", "orange")

# Create the 3D scatterplot with custom colors.
plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, z = ~Petal.Width,
        type = "scatter3d", mode = "markers",
        color = ~Species,
        marker = list(color = custom_colors[iris$Species])) %>%
  layout(
    scene = list(
      xaxis = list(title = 'Sepal length'),
      yaxis = list(title = 'Petal length'),
      zaxis = list(title = 'Petal width')
    )
  )

This visualization is quite informative when it comes to understanding the distribution of the species across the three variables, which allows assess their proximity or separation more easily. For instance, we can observe that the setosa species forms a distinct and isolated cluster, whereas the versicolor and virginica species, while still forming noticeable clusters, exhibit a slight overlap. These clearly distinguishable clusters are promising for our future machine learning tasks, as they are expected to assist our model in generating accurate predictions.

Box plot of sepal width:

# Create the box plot with custom fill colors.
ggplot(
  data = iris, 
  mapping = aes(x = Species, y = Sepal.Width, fill = Species)
) +
geom_boxplot() +
scale_fill_manual(values = custom_colors) +  # Specify custom fill colors here.
theme_light() +
labs(title = 'Box plot of sepal width for each species', 
     x = 'Species', y = 'Sepal width')

From this box plot it can be seen that the setosa species has a higher sepal width median and interquartile range compared to the other two species. In contrast, the Versicolor and Virginica show quite a bit of overlap with each other in term of their interquartile range. This will make it harder for a machine learning algorithm to distinguish between the two species levels when predicting using this variable.

Box plot of sepal length:

# Create the box plot with custom fill colors.
ggplot(
  data = iris,
  mapping = aes(x = Species, y = Sepal.Length, fill = Species)
) +
geom_boxplot() +
scale_fill_manual(values = custom_colors) +  # Specify custom fill colors here.
theme_light() +
labs(title = 'Box plot of sepal length for each species',
     x = 'Species', y = 'Sepal length')

The sepal length values for the three species show some degree of overlap in their ranges.

Box plot of petal width:

# Create the box plot with custom fill colors.
ggplot(
  data = iris,
  mapping = aes(x = Species, y = Sepal.Length, fill = Species)
) +
geom_boxplot() +
scale_fill_manual(values = custom_colors) +  # Specify custom fill colors here.
theme_light() +
labs(title = 'Box plot of sepal length for each species',
     x = 'Species', y = 'Sepal length')

This box plot suggests that there is a noticeable variation in petal width among the different species.

Box plot of petal length:

# Create the box plot with custom fill colors.
ggplot(
  data = iris,
  mapping = aes(x = Species, y = Petal.Length, fill = Species)
) +
geom_boxplot() +
scale_fill_manual(values = custom_colors) +  # Specify custom fill colors here.
theme_light() +
labs(title = 'Box plot of petal length for each species',
     x = 'Species', y = 'Petal length')

This plot seems to indicate that the three species vary in terms of interquartile range on petal length. The setosa-species seems to have a very narrow interquartile range and have quite a lot shorter petal length compared to the other two species.

Data partitioning

Random seed ensures that we can replicate our analysis consistently, enabling us to obtain the same results whenever we re-run our experiments.

set.seed(222)

Split data 80% - 20% for training and testing.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
train_index <- createDataPartition(y = iris$Species,  # y = our dependent variable.
                                   p = .8,  # Specifies split into 80% & 20%.
                                   list = FALSE,  # Sets results to matrix form. 
                                   times = 1)  # Sets number of partitions to create to 1. 

Split data into train and test data using the randomly sampled train_index.

train_data <- iris[train_index,]  # Use train_index of iris data to create train_data.
test_data <- iris[-train_index,]  # Use whatever that is not in train_index to create test_data.

Machine Learning

Predict the species category (setosa, versicolor, virginica) for an iris flower.

Create Model

Model the decision tree model with a 10 fold cross validation.

fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

Create a predictor model with the train() function from the CARET package. Specify method = 'rpart' to run a decision tree model.

# Create model
dt_model <- train(Species ~ ., # Set Y variable followed by '~'. The period indicates to include all variables for prediction. 
                     data = train_data, # Data
                     method = 'rpart', # Specify SVM model
                     trControl = fitControl) # Use cross validation

Check the predicted accuracy of our decision tree model by running it on resamples of train data. Later test the accuracy of the model by running a prediction on test data.

confusionMatrix(dt_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       30.0       5.0
##   virginica     0.0        3.3      28.3
##                             
##  Accuracy (average) : 0.9167

The results here tell us that our average accuracy is 91.67% when testing our data on resamples of our training data. We can also see what was predicted correctly/incorrectly.

Check the importance of each feature in our model.

# Create object of importance of our variables 
dt_importance <- varImp(dt_model)

# Create plot of importance of variables
ggplot(data = dt_importance, mapping = aes(x = dt_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Decision tree model") + # Title
  theme_light() # Theme

This table gives very informative overview of the importance of each variable in predicting the species.

Petal.Width and Petal.length are the most important factors here.

Plot the decision tree using fancyRpartPlot() from the RATTLE package.

library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(dt_model$finalModel)

PREDICTION: Decision tree model

Use the created dt_model to run a prediction on the test data.

prediction_dt <- predict(dt_model, test_data)

Check the proportion of the predictions which were accurate.

table(prediction_dt, test_data$Species) %>% # Create prediction table. 
  prop.table() %>% # Convert table values into proportions instead of counts. 
  round(2) # Round numbers to 2 significant values. 
##              
## prediction_dt setosa versicolor virginica
##    setosa       0.33       0.00      0.00
##    versicolor   0.00       0.33      0.03
##    virginica    0.00       0.00      0.30
# Calculate accuracy from the prediction table.
prediction_table_dt <- table(prediction_dt, test_data$Species)
accuracy_dt <- sum(diag(prediction_table_dt)) / sum(prediction_table_dt)
accuracy_dt
## [1] 0.9666667

Final accuracy of Decision Tree model: 97%

A confusion matrix is used to evaluate the performance of a classification model by showing how well the model’s predictions align with the actual classes.

The values within the matrix represent the number of instances that were classified into each combination of actual and predicted classes. In this case, the numbers are presented as proportions (values between 0 and 1) rather than raw counts.

Overall, the diagonal elements (e.g., (setosa, setosa), (versicolor, versicolor), and (virginica, virginica)) represent correct predictions, while the off-diagonal elements represent misclassifications.

This confusion matrix provides information about the model’s performance, particularly in terms of precision, recall, and accuracy. You can compute various metrics such as accuracy, precision, recall, and F1-score based on the values in the confusion matrix to assess how well the predictions were made.

Use 10 fold cross validation. Use train() function to set trControl next.

fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

Create a training model using train() from the CARET package.

# Create model
rf_model <- train(
                  Species ~ .,  # Set Y variable followed by "~." to include all variables in formula.
                  method = 'rf',  # Set method as random forest.
                  trControl = fitControl,  # Set cross validation settings
                  data = train_data)  # Set data as train_data. 

Use the varImp() function to grab the importance of each variable.

# Create object of importance of our variables 
rf_importance <- varImp(rf_model) 

# Create box plot of importance of variables
ggplot(data = rf_importance, mapping = aes(x = rf_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Random forest model") + # Title
  theme_light() # Theme

Petal length and width are the most important variables.

Check the predicted accuracy of the random forest model by running it on train data. Later test the accuracy of the model by running a prediction on test data.

confusionMatrix(rf_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       30.8       5.0
##   virginica     0.0        2.5      28.3
##                            
##  Accuracy (average) : 0.925

The predicted accuracy of model is 92.5%.

Prediction: Random forest model

Use rf_model to run a prediction on the test data.

prediction_rf <- predict(rf_model, test_data)

Check the accuracy of random forest model on test data.

table(prediction_rf, test_data$Species) %>% # Create prediction table. 
  prop.table() %>% # Convert table values into proportions instead of counts. 
  round(2) # Round numbers to 2 significant values. 
##              
## prediction_rf setosa versicolor virginica
##    setosa       0.33       0.00      0.00
##    versicolor   0.00       0.33      0.03
##    virginica    0.00       0.00      0.30
# Calculate accuracy for random forests predictions.
prediction_table_rf <- table(prediction_rf, test_data$Species)
accuracy_rf <- sum(diag(prediction_table_rf)) / sum(prediction_table_rf)
accuracy_rf
## [1] 0.9666667

Final accuracy of Random Forest model: 97%

Model the Naive Bayes model with a 10 fold cross validation.

fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

Create a predictor model with the train() function from the CARET package. Specify method = 'nb' to run a Naive Bayes model.

# Create model
nb_model <- train(Species ~ ., # Set y variable followed by '~'. The period indicates that we want to use all our variables for prediction.
                     data = train_data,
                     method = 'nb', # Specify Naive Bayes model
                     trControl = fitControl) # Use cross validation

Check the predicted accuracy of model by running it on train data. Later test the accuracy of the model by running a prediction on test data.

confusionMatrix(nb_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       30.8       2.5
##   virginica     0.0        2.5      30.8
##                           
##  Accuracy (average) : 0.95

The average accuracy is 95% when testing data on training data.

Use the varImp() function to grab the importance of each variable.

# Create object of importance of our variables 
nb_importance <- varImp(nb_model) 

# Create box plot of importance of variables
ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Naive Bayes model") + # Title
  theme_light() # Theme

This table gives very informative overview of the importance of each variable in predicting each species. The petal width and length are the two most important variables for predicting each species.

PREDICTION: Naive Bayes Model

Use the created nb_model to run a prediction on the test data.

prediction_nb <- predict(nb_model, test_data)

Check what proportion of the predictions which were accurate.

table(prediction_nb, test_data$Species) %>% # Create prediction table. 
  prop.table() %>% # Convert table values into proportions instead of counts. 
  round(2) # Round numbers to 2 significant values. 
##              
## prediction_nb setosa versicolor virginica
##    setosa       0.33       0.00      0.00
##    versicolor   0.00       0.33      0.03
##    virginica    0.00       0.00      0.30
# Calculate accuracy from the prediction table.
prediction_table_nb <- table(prediction_nb, test_data$Species)
accuracy_nb <- sum(diag(prediction_table_nb)) / sum(prediction_table_nb)
accuracy_nb
## [1] 0.9666667

Final accuracy of Naive Bayes model: 97%

Model the SVM model with a 10 fold cross validation.

fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

Create a predictor model with the train() function from the CARET package. Specify method = 'svmLinear' to run a SVM model.

# Create model
svm_model <- train(Species ~ ., # Set Y variable followed by '~'. The period indicates to include all variables for prediction. 
                     data = train_data, # Data
                     method = 'svmLinear', # Specify SVM model
                     trControl = fitControl) # Use cross validation

Check the predicted accuracy of naive Bayes model by running it on train data. Later test the accuracy of the model by running a prediction on test data.

confusionMatrix(svm_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       29.2       0.8
##   virginica     0.0        4.2      32.5
##                           
##  Accuracy (average) : 0.95

The average accuracy is 95% when testing data ontraining data.

Use the varImp() function to grab the importance of each variable in random forest model and then plot them.

# Create object of importance of our variables 
svm_importance <- varImp(svm_model)

# Create box plot
ggplot(data = svm_importance, mapping = aes(x = svm_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Support vector machine model") + # Title
  theme_light() # Theme

This table gives very informative overview of the importance of each variable in predicting each species. Petal length and petal width are the two most important variables for predicting each species.

PREDICTION: Support Vector Machine

Use the created svm_model to run a prediction on the test data.

prediction_svm <- predict(svm_model, test_data)

Check the proportion of the predictions which were accurate.

table(prediction_svm, test_data$Species) %>% # Create prediction table. 
  prop.table() %>% # Convert table values into proportions instead of counts. 
  round(2) # Round numbers to 2 significant values. 
##               
## prediction_svm setosa versicolor virginica
##     setosa       0.33       0.00      0.00
##     versicolor   0.00       0.33      0.03
##     virginica    0.00       0.00      0.30
# Calculate accuracy for the SVM model from the prediction table.
prediction_table_svm <- table(prediction_svm, test_data$Species)
accuracy_svm <- sum(diag(prediction_table_svm)) / sum(prediction_table_svm)
accuracy_svm
## [1] 0.9666667

Final accuracy of SVM model: 97%

Table of results:

Machine learning model Predicted Accuracy Tested accuracy
Decision tree 91.67% 97%
Random Forest 92.5% 97%
Naive Bayes 95% 97%
Support Vector Machine 95% 97%